Basics in R package Building:human centered problems and solutions

Elisabeth Dahlqwist & Nissa Ferm
R-Ladies Stockholm on October 23, 2019

Elisabeth Dahlqwist

AF

  • PhD in Biostatistics from Karolinska Institutet (KI). Currently working as a methods statistician at Statistics Sweden (SCB).

  • Interested in causal inference and data quality.

  • Author of the “AF” package for estimating the attributable fraction.

Nissa Ferm

nissa

  • Recent transplant to Stockholm.
  • Government fisheries researcher turned data scientist.
  • Built R-packages FastrCAT, rrza, and fishgutr.
  • Made my first PR during Tidyverse Dev Day on the dplyr package 🎉.
  • I also love crafting and sea critters 😍!

R Packages on CRAN

plot of chunk unnamed-chunk-4

As of this year

  • 15007 on CRAN

  • ~ 2116 on GitHub (includes dev versions of CRAN Packages)

  • 1741 on BioConductor

  • ~ 18864 + known packages! 🤯

So many packages, why make more?

  • A 📦 can be just for you

  • A 📦 can be specific to your use case

  • 📦's are easily shareable

  • 📦's are great for method development, easy documentation

  • 📦's encapsulate a project, all files in one place!

Elisabeth: why I built a package

AF

  • We recognized the need to implement (and develop) epidemiological methods in R in order to ease its use.

  • Made us think about how we could format our package to make it user-friendly.

Nissa: why I built a package

fastrcatdata

  • Data was trapped in a particular type of oceanographic file

  • Thousands of these files were created each field season

  • Past methods meant data wasn't available for more than a year

  • I wanted to use the data while still out on the research cruise 🚢 📈!

So, I built FastrCAT

fastrcatdata

FastrCAT…

  • streamlined the data acquisition
  • did not have to wait a year or more
  • has functions to produce maps, plots and reports
  • the package was used successfully this past field season!

Today's Journey

Basics

  • style
  • readable code

Package design decision examples:

  • documentation
  • dependencies
  • user interaction

First Steps: Picking a Style

bowie

What is human readable code?

Wikipedia defines human readable as,

“A human-readable medium or human-readable format is a representation of data or information that can be naturally read by humans.”


As humans we are great at telling stories, your code is a narrative

Readable Code Basics

  • meaningful file names, w/o spaces
arctic_fish_data_clean.csv
  • not
finalFINAL_final I really mean it pleaseBE done.csv
  • objects or variables are nouns
friend_group <- c("Dominique", "Hollis", "Sam", "Robyn", "Ridley") 
acorn_count <- c(1, 3, 5, 7, 2, 0)

Readable Code Basics

  • functions are verbs, actions
who_called <- function(friend_group){"Who called me?"}
feed_the_squirrel <- function(acorn_count){"Do I have enough acorns to feed each squirrel?"}
  • when naming objects or functions use snake_case 🐍_🐍 or CamelCase 🐫🐫
  • do not use dot.case

Documentation

We all love to write documentation 😐

  • R package development methods force you to document
  • You can tailor documentation relative to your audience
  • Documentation comes at many levels
    • variable
    • function
    • package
    • vignette / walk through example

Både hängsle och livrem

  • Do not confuse! To as large extent as possible, use the same argument names as some package you build your package upon, or are familiar with.
  • Give reproducible examples for different functionalities of your function and write down definitions.
  • Complement with describing your method in a journal article or a vignette.

article1

Writing Error Messages

  • error writing is a form of documentation
  • understand how and why errors occur
  • write clear and unambiguous directions
  • use the appropriate level of language
  • and be positive!

How the error occurred

flowchart

  • error occurred at step one

  • error was common, I did it too!

  • if step one did not occur, .up files would be missing header information

  • before package error might go undetected for a year

  • at this point fixing error became complicated

Handling the error message

# Determines if file has header information -----------------------------------
    if(length(grep("@ ", full_table,ignore.case = TRUE)) == 0){

      no_head_files[[i]] <- paste("This file", temp[i],
                                 "has no header info needs to be reprocessed.",
                                 sep = " ")

      warning(paste("This file", temp[i],
                    "has no header info, needs to be reprocessed.", sep = " "))

      next()

    }else{
  • conditional statement checks for header in single file
  • if no header info found writes warning to console.
  • makes a table of all files without headers to write in report.

Handling the error message

  No_head_files <-if(is.null(no_head_files) == TRUE){

    "All header information entered into MasterCOD. High Five!"

  } else {

   data.frame(unlist(no_head_files))
  }
  • Additional conditional determines if error table has errors
  • This will be written to the report
  • Add positive reinforcement!

Careful with your dependencies

somanypackages

Dependencies are the other packages your package needs to run. You should try and ask yourself these questions

  • Does the size of your package matter?
  • How much complexity is needed?
  • Do my users understand the packages I'm adding as dependencies?
  • If the other package function is the best, is there a reason to rewrite it?

It Depends

image of long depency load

  • Be aware of the difference between types of dependencies vs. depends | import | suggests
  • *Do not use depends *

warning

Making maps

maps

  • Maps were requested after all the plot functions were added.

  • There were limitations for choosing a mapping package
    • no to basic internet out at sea
    • using ggmap was out
  • To move forward, I needed to bundle our map files within the package
    • the package was going to be large
    • and full of dependencies

Making maps

maps

Mapping function dependencies

  • sf: Simple Features
  • sp: Classes and Methods for Spatial Data
  • gstat: Spatial and Spatio-Temporal Geostatistical Modelling, Prediction and Simulation
  • raster: Geographic Data Analysis and Modeling
  • ggplot2: data visualization

Making maps

  • How the base map is brought into the function
# bring in the shape files to make the basemap --------------------------------

MAP <- sf::st_read(dsn = system.file("extdata", package = "FastrCAT"),
                   layer = "Alaska_dcw_polygon_Project", quiet = TRUE)
  • This part tells the function where to find the mapping files in your package library
system.file("extdata", package = "FastrCAT")
  • When you are writing your package you place the files in the inst/extdata folder
  • One caveat is that the files are not hidden.
  • For other file types you could convert them to sysdata.rda to hide the original files.

Human centered interactions

“Good design is actually a lot harder to notice than poor design, in part because good designs fit our needs so well that the design is invisible, serving us without drawing attention to itself. Bad design, on the other hand, screams out its inadequacies, making itself very noticeable.”

Donald A. Norman, The Design of Everyday Things

Formatting the output

cars

  • Use S3 or S4 classes in R to format the output so that your users easily understands the output.
  • Your user may want loads of information from your package but it is not very useful to print all at once.
  • Example glm. You actually get a lot of output but only selected information is shown when using print(fit) or summary(fit). This is what you use “object oriented” programming with S3 and S4 classes for.

Formatting the output

Summary of glm object

> summary(fit)

Call:
glm(formula = Y ~ X + Z + X * Z, family = binomial, data = data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.4742  -0.8863   0.4277   0.8371   2.3265  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.1806     0.1079  -1.674   0.0941 .  
X             1.0075     0.1544   6.525  6.8e-11 ***

Summary of AFglm object

> summary(AFglm_est)
Call:  
AFglm(object = fit, data = data, exposure = "X")

Estimated attributable fraction (AF) and untransformed 95% Wald CI: 

        AF  Std.Error  z value     Pr(>|z|) Lower limit Upper limit
 0.1664325 0.03027685 5.497019 3.862643e-08   0.1070909    0.225774

User Workflow

terminal

  • Two Users, those who knew R and those who did not
  • I built workflow around non-R users needs which required:
    • determine barriers and follow a path of least resistance
    • matching workflow to running Perl script in terminal
    • secretly having them open an R session in terminal
    • explicit instructions/examples to run functions in session

Package it all up🎉

Today we learned

  • We can have different experiences/reasons to build packages

  • Meet your users where they are at and support them

  • You should Document, Document, Document…nDocument

  • When you are ready to build a package there is a whole community to support you.

Package building resources

Writing an R package from scratch, Hilary Parker of Not So Standard Deviations
https://hilaryparker.com/2014/04/29/writing-an-r-package-from-scratch/

R Packages by Hadley Wickam
http://r-pkgs.had.co.nz/

R Packages: The Whole Game by Jenny Bryan
https://r-pkgs.org/whole-game.html also see https://stat545.com/

R package primer by Karl Broman
https://kbroman.org/pkg_primer/

Thank you!🎉

Find all the slides and code here https://github.com/R-Ladies-Stockholm/Package-Basics-Presentation

Follow us on Twitter @RLadiesSthlm or Facebook @RLadiesStockholm

In the near future 5 minute favorite package talks and Package building workshop!

And now for a quick package building demo.